Integrating spatial transcriptomics and bulk RNA-seq: predicting gene expression with enhanced resolution through graph attention networks

Abstract Spatial transcriptomics data play a crucial role in cancer research, providing a nuanced understanding of the spatial organization of gene expression within tumor tissues. Unraveling the spatial dynamics of gene expression can unveil key insights into tumor heterogeneity and aid in identifying potential therapeutic targets. However, in many large-scale cancer studies, spatial transcriptomics data are limited, with bulk RNA-seq and corresponding Whole Slide Image (WSI) data being more common (e.g. TCGA project). To address this gap, there is a critical need to develop methodologies that can estimate gene expression at near-cell (spot) level resolution from existing WSI and bulk RNA-seq data. This approach is essential for reanalyzing expansive cohort studies and uncovering novel biomarkers that have been overlooked in the initial assessments. In this study, we present STGAT (Spatial Transcriptomics Graph Attention Network), a novel approach leveraging Graph Attention Networks (GAT) to discern spatial dependencies among spots. Trained on spatial transcriptomics data, STGAT is designed to estimate gene expression profiles at spot-level resolution and predict whether each spot represents tumor or non-tumor tissue, especially in patient samples where only WSI and bulk RNA-seq data are available. Comprehensive tests on two breast cancer spatial transcriptomics datasets demonstrated that STGAT outperformed existing methods in accurately predicting gene expression. Further analyses using the TCGA breast cancer dataset revealed that gene expression estimated from tumor-only spots (predicted by STGAT) provides more accurate molecular signatures for breast cancer sub-type and tumor stage prediction, and also leading to improved patient survival and disease-free analysis. Availability: Code is available at https://github.com/compbiolabucf/STGAT.


Section S1: Evaluation Metrics
1. Correlation score: The pearson correlation coefficient between the true and predicted gene expression profiles was used as the evaluation metric for comparison between STGAT and the baseline methods following the formula: where y i is the true gene expression of a spot and y i is the corresponding predicted gene expression.µ i and µ i are their mean gene expressions, respectively.The correlation coefficient reveals the level of similarity and the direction of the relationship between two values.Therefore, it can be a good predictor of how similar the predicted gene expression values are to the true gene expression values.

Mean Squared Error (MSE):
This metric is also used to compare the performance of the models.It is computed between the true and predicted gene expression profiles following the formula: where Y i and Y i are the predicted and true gene expressions of a patient sample.MSE calculates the difference or distance between the two values.Therefore, MSE tells us how much difference is present between the predicted and true gene expression values.The correlation coefficient along with MSE together enables us to comprehend the closeness between the predicted and true data, thereby comparing the performance of the models.
3. Area Under the Receiver Operating Characteristic curve (AUROC): This metric is used for comparison of the classification tasks on the TCGA data.It is defined as the area under the curve plotted using True Positive Rate (precision) along the y-axis and False Positive Rate (1-specificity) along the x-axis.It was implemented using scikit-learn [1] python package.
Section S2: Tables Length of a convolution embedding vector for a single spot X ij Spot image at the j th position of the i th spatial image Embedding vector generated from a CNN block for a single spot Concatenated embedding of all the spots of image i

Figure S4 :
Figure S4: Correlation coefficient between true and generated gene expression on test samples with varying numbers of gene profiles used.

Figure S5 :
Figure S5: Ablation study of the STGAT framework.Performance comparison includes STGAT without the GAT layer in the SEG module (STGAT -GAT), STGAT without bulk RNA-seq gene expression guidance in the GEP module (STGAT -bulk), STGAT without the z-score normalization step in the GEP module (STGAT -norm), and the complete STGAT framework.
Adjacency matrix for image X i N ijSet of neighbors for the j th spot in the i th image α n ijk Neighbor attention coefficient for the k th neighbor of j th spot of i th image a s , a n ∈ R ea Self and neighbor attention vectors W s , W n ∈ R ea×ecSelf and neighbor weight matrices E Concatenation of embedding generated by all the heads for j th spot Y i ∈ R gi×p Output prediction matrix for i th image Comparison between STGAT and the baselines in terms of Mean Squared Error (MSE) loss computed between the predicted and true spot-level gene expression on the 'breast cancer dataset'.Figure S2: Comparison between STGAT and the baselines in terms of Mean Squared Error (MSE) loss computed between the predicted and true spot-level gene expression on the 'HER2+ dataset'.FigureS3: KEGG pathways with the highest mean Pearson correlation between the true and STGAT predicted gene expression profiles.The bubble size represents the number of genes contained in a pathway, and the color shades represent the standard deviation of correlation of the test samples.
gi×ecEmbedding produced by linear block for image i